Lesson: Using Text Data Processing 


Entity Extraction Enhancements 


The Entity Extraction transform can now identify the input text language in any of the 31 
supported languages. Select Auto for the Language option — the default — enables this 
capability. If the language is identified as any of the 31 supported, extraction will be done in 
that language if the language module is installed. 


Also, if any custom dictionaries and/or rules are specified, those containing the language 
name in the file name will automatically be applied, as well as any that don't contain a 
language name in the file name. This supports customizing a single Entity Extraction 
transform for multiple languages. As seen in the following figure, anew LANGUAGE output 
column is also available that indicates the language of each extraction found. 


When the input text language cannot be identified, a non-fatal error is logged. 


TEMPLATE TABLE1 (DSDatastore.DBO) 


STANDARD_FORM_ TYPE OFFSET LENGTH LANGUAGE 
bonne volanté NOUN_GROUP 148 13 french 


good men NOUN_GROUP 301 8 english 
guten Menschen NOUN_GROUP 473 14 german 


A Figure 41: Auto Language Detection 


Text Data Processing now supports 13 different languages with predefined entity type 
support. The Dutch and Portuguese language modules support more than 30 predefined 
entity types that can be used out-of-the-box. 





While the Russian and Korean language modules already supported predefined entity types, 
we are expanding support to ensure consistent coverage across all of the 13 languages that 
provide predefined entity types. 


Sentiment analysis is now supported in Simplified Chinese. All five sentiment levels, Strong/ 
Weak, Positive/Negative, Neutral, and Major/Minor Problem, are available along with 
requests. Topics can be identified for each sentiment and request. 


The Entity Extraction transform is used by some customers to identify profanity within text 
before responses to questions or complaints are sent to their customers by customer service 
representatives. Dictionaries which can be used to identify profanity are now supported in 
French, German, and Spanish. Additionally, emoticons, social media jargon, and slang can 
now be identified within input text as sentiment. 


Data Services now supports pushing down processing of an HDFS delimited file format to 
Hadoop when it is connected to an Entity Extraction transform. 
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Figure 42: Pushdown of Delimited File Formats to Hadoop 





Whether processing a CSV file or a weblog, the Entity Extraction MapReduce job processes 
the mapped input column directly within Hadoop instead of sending data back and forth to 
the Data Services Job Server, thus increasing efficiency. 
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